On Cluster Analysis Based on Contingency Tables

نویسنده

  • F. El-Mouadib
چکیده

In _ Zytkow and Zembowicz (1993), a system for cluster analysis, based on two-way contingency tables, has been proposed. In its original form, the system is con ned to categorical and discrete multivariate data. The aim is to build a classi cation tree when, at each level of the tree, only binary splits are allowed. Within a one-level taxonomy formation, a systematic search is performed for pairs of variables whose association is \strong enough" in some sense (in the original algorithm, measures based on the 2 statistic are used). If such pairs are found, one of them is selected on a rather heuristic basis to guide a split. Due to its algorithmic simplicity, the system can be considered suitable for data mining applications. In the talk, the methodology proposed, including some re nements, will be presented. In particular, (i) simple and admitting clear interpretation measures of association, introduced by Goodman and Kruskal, will be used in lieu of -like statistics; (ii) instead of using some heuristics, the choice of the pair to guide a split will be based on maximization of the socalled category utility (as suggested, e.g., in Fisher and Hapanyengwi (1993) and based either on entropy or on the Gini index); (iii) discretization of, and hence permitting, continuous variables will be proposed. The search for pairs of variables conductive to one-level taxonomy is mostly based on the classical measure. The strength of association within such a pair is then increased by suitably aggregating the values of the two variables. If an (approximate) equivalence relationship between the aggregated variables is thus obtained, it can be used to guide the split (both aggregation and the split can be aided by correspondence analysis of the contingency table involved).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Dynamic Longitudinal Categorical Data in Incomplete Contingency Tables Using Capture-Recapture Sampling: A case Study of Semi-Concentrated Doctoral Exam

Abstract. In this paper, dynamic longitudinal categorical data and estimation of their parameters in incomplete contingency tables are evaluated. To apply the proposed method, a study has been conducted on the data of the semi-concentrated doctoral exam of the National Organization for Educational Testing (NOET). The results of studies such as the obtained confidence intervals and calculating t...

متن کامل

Plain Answers to Several Questions about Association/Independence Structure in Complete/Incomplete Contingency Tables

In this paper, we develop some results based on Relational model (Klimova, et al. 2012) which permits a decomposition of logarithm of expected cell frequencies under a log-linear type model. These results imply plain answers to several questions in the context of analyzing of contingency tables. Moreover, determination of design matrix and hypothesis-induced matrix of the model will be discusse...

متن کامل

Directional Clustering Tests Based on Nearest Neighbor Contingency Tables

Spatial interaction between two or more classes or species has important implications in various fields and causes multivariate patterns such as segregation or association. Segregation occurs when members of a class or species are more likely to be found near members of the same class or conspecifics; while association occurs when members of a class or species are more likely to be found near m...

متن کامل

New Tests of Spatial Segregation Based on Nearest Neighbor Contingency Tables

The spatial clustering of points from two or more classes (or species) has important implications in many fields and may cause the spatial patterns of segregation and association, which are two major types of spatial interaction between the classes. The null patterns we consider are random labeling (RL) and complete spatial randomness (CSR) of points from two or more classes, which is called CS...

متن کامل

Tests in contingency tables as regression tests

Applied researchers often use tests based on contingency tables in preliminary data analysis and diagnostic testing. We show that many of such tests may be alternatively implemented by testing for coefficient restrictions in linear regression systems (as a rule, employing the Wald test). This unifies the theories of regression analysis and contingency tables, sheds more light on intuitive conte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999